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Abstract 

Many computational problems in image processing, signal processing, and scientific 
computing are naturally structured for either pipelined or parallel computation. When 
mapping such problems onto a parallel architecture it is often necessary to aggregate 
an obvious problem decomposition. Even in this context the general mapping problem 
is known to be computationally intractable, but recent advances have been made in 
identifying classes of problems and architectures for which optimal solutions can be 
found in polynomial time. Among these, the mapping of pipelined or parallel computa- 
tions onto linear array, shared memory, and host-satellite systems figures prominently. 
This paper extends that work first by showing how to improve existing serial mapping 
algorithms. Our improvements have significantly lower time and space complexities: in 
one case we reduce a published 0(nm z ) time algorithm for mapping m modules onto n 
processors to an 0(nm log m) time complexity, and reduce its space requirements from 
0(nm 2 ) to O(m). We then reduce run-time complexity further with parallel mapping 
algorithms based on these improvements, that run on the architectures for which they 
creating mappings. 


‘This research was supported in part by the National Aeronautics and Space Administration under NASA 
contract NASl-18107 while the author was in residence at ICASE, Mail Stop 132C, NASA Langley Research 
Center, Hampton, VA 23665. 
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1 Introduction 


Many computational problems in image processing, signal processing, and scientific com- 
puting are naturally structured for either pipelined or parallel computation. It is common 
for an obvious problem decomposition to have more components, or “modules” than there 
are processors. We must then map the computation by aggregating modules. The general 
mapping problem is known to be intractable but recently advances have been made by 
Bokhari[4] in identifying classes of problems and architectures for which optimal solutions 
can be found in polynomial time. Among these types of computations, a set of mod- 
ules configured as a chain figures prominently. Unless otherwise stated, all references to 
Bokhari’s work refer to [4]. 

As pointed out by Bokhari, the problem of mapping module chains onto different types 
of architectures frequently arises in image and signal processing applications; it may also 
arise in the parallel solution of partial differential equations. The concept of “module” 
can be quite general. For example, a signal processing application may require a signal to 
be Fourier-transformed, massaged in the frequency domain and then inverse-transformed. 
Each stage may be viewed as a module, or a stage may be subdivided into a sequence of 
modules. In an image processing context we may find similar processing stages for every 
frame of data. A common means of numerically solving a partial differential equation 
(PDE) in parallel is to decompose the PDE domain into strips[12]. The computation asso- 
ciated with a strip is the collection of all grid point updates required for points within the 
strip. The communication requirements between strips gives this computation a chain-like 
structure. At a given iteration, all strips may be updated in parallel, with communication 
occuring at the iteration’s end. Grids may be irregular, giving strips different execution 
weights. A viable means of balancing the workload is to decompose the domain into many 
more strips than there are processors, and then aggregate them into equi-weighted super- 
strips. Modules are also easily identified in a computation described by a directed acyclic 
graph (DAG) whose nodes describe computations, and whose arcs define data dependen- 
cies. The “level” of a DAG node u is the smallest number of nodes on a path from any 
source node (no incoming arcs) to u; the collection of all nodes at a given level can con- 
stitute a module. It is not immediately obvious that a chain structure between modules 
should result, since a node in module k (equivalently, at level k) may depend on a node 
in module i < k — 1. However, we can create a chain-like communication arrangement if 
we require every module j to transmit all its results to module j + 1 and to transfer any 
results received from module j — 1 which are to be used by modules k > j. We can then 
pipeline multiple independent invocations of the DAG computation. 

Current parallel architectures compel us to at least consider chain decompositions. For 
example, the CMU Warp[l] is a linear array of high-powered processors, so that pipelining 
sequential modules is a natural solution approach. It can he advantageous to use chains 
even if the communication topology is rich. For example, the Intel iPSC (hypercube) has 
very high communication startup costs which are nearly independent of the message size. 
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Better performance is sometimes seen by minimizing the number of messages, rather than 
the message volume [14]. The performance-conscious programmer again is encouraged 
to limit the interconnection structure of the problem decomposition; a chain offers the 
simplest of useful structures. 

Lot M i , M 2 , . . . , M m denote a chain of m modules which may be executed concurrently. 
As wc have described, the modules may form a pipeline of computations or may describe a 
parallel computation whose communication requirements are local. The mapping problem 
under the contiguity constraint is to assign each Mi to one of n processors in such a way 
that the set of modules assigned to a processor forms a contiguous subchain of Mi , . . . , M m . 
The problem becomes non-trivial when we allow the modules to have individual execution 
times (called module weights), and require an explicit communication cost for mapping 
Mi and M {+ 1 onto different processors. A processor’s time during the computation is 
spent either executing a module, communicating results, or waiting for results so that it 
can continue. Under any mapping there will be at least one “bottleneck” processor who 
limits the computational rate. We seek the mapping which minimizes the execution and 
communication time of the associated bottleneck processor. Bokhari gives polynomial- 
time algorithms for optimally mapping a chain onto a linear array of processors, mapping 
a chain onto a shared memory machine, and mapping a collection of chains onto a system 
consisting of a central host with a number of attached satellite processors. 

Bokhari solves these problems with a layered graph. A graph node at layer i describes 
one possible assignment of modules to the ith processor. Layer i has a node for every 
possible assignment. Edges exist only between nodes in adjacent layers, and are always 
rooted in the layer with smaller index. An edge leaving a node is labeled with the cost 
of the associated processor assignment. Edges are defined so that every path through the 
graph describes a legal mapping, and the edges on that path can be analyzed to give the 
mapping’s cost. A least-cost path algorithm is employed to find the optimal mapping. 
His algorithms map a chain onto a linear array in 0(nm 3 ) time and space, onto a shared 
memory machine in 0(nm 3 log m) 1 time and 0(nm 3 ) space, and map a set of n chains 
onto a host with n satellites in 0(nm 2 logm) time and 0(nm 2 ) space. 

The value of m can be quite large for applications whose modules are fine-grained. In 
such cases an 0(nrn 3 ) algorithm is unattractive. This is especially true since the mathemat- 
ical model we employ to assess a mapping’s cost is quite simple, and ignores architectural 
details which may impact the accuracy of the model. Accepting that a simple mathe- 
matical model of the mapping problem is still desirable, it is important then to find ways 
to reduce the complexity of the approach. Iqbal [7] does so by considering approximation 
algorithms that find a solution guaranteed to be within e of the true optimal. Letting Wj 
denote the sum of all module weights, his method finds the minimal approximate solu- 
tion to the linear array problem in 0(nm log(WV/e)) time, to the shared memory problem 
in 0(m 2 \og(W r / e)) time 2 , and to the host-satellite problem in 0(nm\og(W T / «)) time. 

1 All logarithms in this paper are base 2. 

2 Iqbal incorrectly claims 0(rn logfltr/O) for this solution. 
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These methods are attractive alternatives to Bokhari’s, provided the user can accept the 
possibility of failing to find the precise optimal solution. Another drawback is that the 
complexity of Iqbal’s method is sensitive to the actual values of the module weights and 
on the degree of accuracy desired. 

Bokhari’s methods and Iqbal’s methods both rely on a “probe” function which finds an 
optimal solution, subject to some constraint. The probe function is repeatedly called, vary- 
ing the the constraint, until an optimal solution is discovered. In § 3 we outline Bokhari’s 
solutions, and show how they are all easily improved by a factor of m by reducing the 
complexity of his probe function. We then examine each problem, and show how to reduce 
the complexities of their respective probe functions, how to reduce the cost of organizing 
the set of probe calls, and how to achieve low expected parallel time complexities by ex- 
ecuting the mapping algorithms on the target architectures. These algorithms’ expected 
complexities are based on the assumption that all module weights are independent samples 
of a common unspecified distribution, and that all communication delays are independent 
samples of a different unspecified distribution. 

In § 4 we reduce the time complexity of Iqbal’s probe function from 0(nm ) to 0(n logm). 
The improvement requires only the additional assumption that communication costs are 
bounded. Then we exploit the problem’s structure and reduce the cost of organizing 
the probe search values from 0(m 2 \ogm) to O(mlogm). The resulting algorithm has 
0( nm log m ) time complexity and O(m) space complexity. Finally, we organize the algo- 
rithm for execution on the linear array itself. The parallel algorithm has an 0(m logm log n) 
time complexity, and O(nm) space complexity. 

In § 5 we reduce the time complexity of a probe function based on Kemighan’s algorithm[8] 
from 0(m 2 ) to G(m logm). Coupled with the search organization developed for the linear 
array problem, we reduce this problem’s time complexity from Bokhari’s 0(nm 3 logm) to 
0(m 2 logm). Our algorithm has 0(m 2 ) space complexity. We then parallelize our solu- 
tion in three ways. One method achieves an 0((m 2 /n) logm log n) time and 0(nm ) space 
complexity; a second achieves an 0((m 2 /n) logm) time and 0(m 2 ) space complexity. The 
third is appropriate when 8n 3 < m , and has an expected 0((m 2 /n) logm) time and O(m) 


space complexity. 

Finally, in § G we use the results of § 4 to reduce the solution time complexity from 
Bokhari’s ()(nm 2 log m) to an 0(max{nm log n, n log 2 m}) time complexity. Our algorithm 
has O(nm) space complexity. We then parallelize the algorithm for execution on the host- 
satellite architecture. When m is sufficiently larger than n the parallel algorithm has an 
O(nm) time complexity which is within a constant factor of optimal when the problem is 
loaded serially. 

The trade-offs between communication costs and load balance have recently been ad- 
dressed by a few researchers. Berger and Bokhari in [2] propose and analyze binary dissec- 
tion of a two-dimensional domain with irregular workload. The solutions they construct 
need not be optimal. A similar problem for finite-element solution methods was studied 
by Sadayappan and Ercal in [13]. Cventanovic in [G] examines mapping, communication, 
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and grainularity issues in an abstract setting. Foundational work for the parallel mapping 
problem was laid by study of the distributed mapping problem. The seminal works in 
this field include papers by Stone [17], [16]; by Bokliari [5], and by Towsley [18]. Bokhari 
summarizes much of this work in [3]. 

2 Model Definitions 

We suppose that a computational problem has been decomposed into m modules Mi , . . . , M m 
These modules may be defined by function, e.g. fast fourier transform, convolution; they 
may also be some partition of a data domain, as in the solution of partial differential 
equations. We imagine that one execution of M, needs data from Mi- j, M t+1 , or both. We 
suppose that a module Mi will be executed many times, each execution requiring Wi > 0 
time on one of a set of n homogeneous processors; the modules are concurrent because 
either results are being pipelined, or the modules are loosely synchronized and exchange 
the necessary data at the conclusion of every iteration. Our expected complexity analysis 
will assume that each w, is drawn independently from a common distribution having finite 
mean /t„, and standard deviation a w . 

We are interested primarily in situations where m is large and n <C m, e.g., n — 10 
and in = 1000. One reason for this focus is that algorithms we develop are somewhat 
more complex than existing ones; for small m the existing algorithms are likely to be fast 
enough for practical use; conversely, for large m they are impractical. A second reason 
is that using the parallel processors to compute the mapping of another computation 
imposes additional overhead, and becomes an attractive option only if the problem size is 
large enough to overcome that overhead. 

At the end of A/.’s execution period there is data available for consumption by M,_i 
and/or A/ I+ |. If one of these modules (say M,_ i) is assigned to the same processor as M,-, 
we assume that the next invocation of M,_j can access that data without additional cost. 
If one of these modules (say A/ 1+ i ) is assigned to a different processor, then M,-’s processor 
must explicitly send data over a communication channel, and we will say that the logical 
link between Mi and A/ t+1 is exposed. The cost of that communication is assumed to 
depend on the communicating modules. Exposing the link between A/, and Mi+i causes 
both modules to incur a delay cost C, > 0 which models all overhead a. processor suffers in 
sending and receiving messages over that link. We make the reasonable assumption that 
C, < C for some constant C which is independent of m. Our expected complexity analysis 
assumes that each C, is drawn independently from a common distribution having finite 
mean //. c and standard deviation a c . 

Wo let S t] = J2i-i w k denote the sum of module weights on the subchain delimited by 
Mi and M y Sij is a single processor’s module evaluation time cost of being assigned the 
subchain. The incorporation of the associated delay costs C t _j and Cj will depend on the 
architecture considered, as shown below. 
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Linear Array: Consider a linear array of processors P lt P 2 , • • ■ , P n \ processor P, has a 
direct communication link with only processors P,_j and We assume that a processor 

is not free to proceed with computation if it is actively engaged in communication. If 
modules M, through Mj are assigned to processor P*, then Pfc’s execution time during one 
iteration is C,_i -4- Sa + Cj. The cost of a complete mapping is the maximum processor 
execution time among all processors; we have called the processor defining this maximum 
the bottleneck processor. If the system we map is a pipeline, then the bottleneck processor 
limits the rate of results, and the mapping cost is the time required to obtain one result 
from a full pipeline. If the system is parallel rather than pipelined the mapping cost is the 
time required by each iteration. In either case we optimize performance by minimizing the 
mapping cost. 

Shared Memory Machine: Consider a collection of identical processors that commu- 
nicate through a shared memory. The communication medium is a shared resource, so 
that it is appropriate to model communication overhead by adding the costs of all exposed 
links. The cost of a mapping is the maximum of (i) the sum of all communication costs 
on exposed links, and (ii) the maximum processor module evaluation costs under the as- 
signment. This model presumes that communication can be overlapped with computation, 
but that the communication medium serializes the communication traffic. 

Host- Satellite Machine: Consider a powerful host machine which has n satellite pro- 
cessors. This arrangement might be appropriate when there are n sensors with attached 
micro-processors. There is a chain-like pipelined computation associated with each satel- 
lite. Without loss of generality we assume that the chain for satellite P t has m modules, 
Mu, ■ • • , Mi m . Satellite P,- can unload some subchain M,j through M, m onto the host at 
the cost of an inter-module communication which is suffered by both host and satellite 
(keeping the whole subchain on the satellite gives a communication cost C,( m+ ij). Unload- 
ing work onto the host also has the effect of increasing the host’s computational load. The 
host’s cost of a mapping is the sum of (i) any load it must always perform, e.g. combina- 
tion of fully processed sensor data, (ii) the sum of module execution times of all satellite 
modules it has received, and (iii) the communication costs associated with each satellite. 
A satellite’s execution time is its module evaluation costs plus its host communication cost. 
An assignment’s cost is the maximum of host cost and maximal satellite execution cost. 

The following section sketches Bokhari’s approach to solving these mapping problems, 
and points out an easy improvement to his algorithms. 


3 Layered Graph Path Algorithms 

Bokhari solves the linear array problem by finding the minimum path through a specially 
created layered graph. The graph has a source node < s > and a sink node < t >. Each 
layer corresponds to a processor. Layer i contains a node for every legal means of assigning 
modules to processor i. For example, node < j, k > at layer i represents the assignment 
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— <2,2> <2,3> - — <3,3> <3,4> <3,5> 


— <3,8> — <4,6> <4,7> <4,8> <5,5> 



— <4,9> <5,9> <6,9> <7,9> <8,9> <9,9> 



Figure 1: Layered Graph for Linear Array Problem, 9 modules, 4 processors 


of modules Mj through Mk to processor i. Each layer contains 0(m 2 ) nodes. An edge is 
directed from node < j,k > in layer i to any node of the form < k + 1 , / > in layer i + 1 . 

< s > directs an edge to every node at layer 1, and every node in layer n directs an edge to 

< t >. Consequently, any path from < s > to < t > corresponds to an assignment which 
satisfies the contiguity constraint. Figure 1 illustrates Bokhari’s own example; while an 
assignment path is shown, many edges are not shown in order to relieve visual congestion. 
The layered graph assumes that every processor receives at least one module. 

An edge out of node < j,k > at layer i is labeled with the value Cj_i + Sjk + Ck- 
It is possible to include a dependence on i here to model heterogeneous processors and 
communication links; for simplicity we assume homogeneity. The cost of a path is the 
value of the maximally weighted edge on the path, which clearly is the time required by 
the bottleneck processor to solve its portion of the problem. A standard least-cost path 
algorithm finds the optimal mapping in 0(graph edges ) time, in this case 0(nm 3 ). 

Least-cost paths through layered graphs are also at the heart of Bokhari’s shared- 
memory and host-satellite problem solutions. Here he develops a general technique of 
analyzing Sum-Bottleneck graphs. An edge e on such a graph has a sum-weight and a 
bottleneck- weight. The cost of a given path through the graph is the maximum of (i) 
the sum of all sum-weights on the path’s edges, and (ii) the maximum bottleneck-weight 
among the path’s edges. The path with minimal cost is found by first identifying all 
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unique bottleneck-weight values, and by sorting them. Then a binary search on the list 
of bottleneck-weight values is performed — for each bottleneck-weight value b visited, a 
shortest path routine TESTPATH(ft) is called. TESTPATH(fe) treats any edge whose 
bottleneck- weight value is greater than b as non-existent. If there is a path from source to 
sink on this edge-reduced graph, then TESTPATH(i) returns the path whose sum of sum- 
weights is minimal. If there is no path between source and sink TESTPATH(fr) returns 
the null path whose cost is defined to be oo. Defining S(6) to be the length of the path 
returned by TESTPATH(6), the binary search seeks the smallest bottleneck value b such 
that b > S(b). The optimal sum-bottleneck solution is then either b or S( 6) , where b is the 
greatest bottleneck value less than b. For each of the layered graphs considered a call to 
TESTPATH(6) has complexity 0(graph edges). 

The sum-bottleneck graph for the shared-memory problem is topologically equivalent 
to that for the linear array problem. An edge directed out of node < j,k > is labeled 
with bottleneck-weight Sjk and sum-weight C^. Each call to TESTPATH has complexity 
0(nm 3 ); the algorithm’s 0(nm 3 log m) complexity follows from the observation that there 
are 0(m 2 ) unique bottleneck values, and hence O(logm) calls to TESTPATH. 

The sum-bottleneck graph for the host-satellite problem again associates a layer with 
a processor. Node < j > at layer i represents the mapping of satellite P,’s first j modules 
onto the satellite, with the remaining modules being mapped onto the host. A node at 
layer i directs an edge to every node at layer i + 1. An example of this graph is shown in 
figure 2. The bottleneck weight on an edge directed out of node < j > in layer i is the sum 
of weights of modules Mu through Mij, plus the communication cost C;j. The sum weight 
on that edge is the sum of through M tm weights with the communication cost Ca. 

To account for an initial host load H, every edge directed out of the source node has a 
sum weight of H and a bottleneck weight of zero. Each call to TESTPATH has 0(nm 2 ) 
time complexity. There are possibly nm unique bottleneck values, giving a 0(nm 2 log m) 
overall complexity. 

The least-cost path algorithm underlying these solutions exploits the fact that the graph 
is layered — for node v at layer i, the least-cost path from the source to v , through node u 
at layer i — 1 must include the least-cost path from the source to u. In fact, this is just a 
statement of the principle of optimality. The algorithm finds the least-cost paths from the 
source to all nodes at layer i — 1 before computing any least-cost path to a node at layer 
i. The least-cost path to v is found by examining every u which directs an edge to v and 
then extending the least-cost path to u with the u — ► v edge. The least-cost extension is 
the least-cost path to v. As we have previously stated, the complexity of this approach 
is proportional to the number of graph edges. A simple trick will reduce the number of 
graph edges without affecting path costs. For the linear array and shared memory problem 
graphs we add n — 2 layers, one between each of the previous layers (except between layers 
1 and 2 where an additional layer provides no benefit). Each new layer has m nodes, 
labeled 1 through m. To avoid confusion we will refer to the u ith n layer in the new graph 
as being identical to the ith layer in the original graph. Node < j,k > in layer i directs a 
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<s> 



< 1 > < 2 > < 3 > < 4 > 



< 1 > < 2 > < 3 > < 4 > 



< 1 > < 2 > < 3 > < 4 > 



< 1 > < 2 > < 3 > < 4 > 



Figure 2: Layered Graph for the Host-Satellite Problem 

single edge to node < k > in the new layer between layers i and i + 1; this edge is labeled 
exactly as before. Node < k > in the new layer in turn directs an edge to every node of 
the form < k + 1,/ > in layer i + 1; every such edge is labeled with weight zero. Figure 3 
illustrates the new graph. Again, many nodes and edges are not shown in order to avoid 
congestion. It is clear that any path from source to sink still defines a legal assignment and 
has a weight identical to that of the corresponding path in the original graph. The number 
of edges drops from 0(nm 3 ) to 0(nm 2 ), reducing the complexity of both the linear array 
and shared memory problems by a factor of m. 

We treat the host-satellite assignment graph similarly. Between layers we interpose a 
single node. Every node at layer i directs a single edge to the node between layers i and 
i -f 1; the edge is weighted as before. The node between layers i and i + 1 directs an edge 
to every node in layer i + 1. The two weights on each such edge are zero. Once again, 
every path identifies an assignment and its cost; by reducing the number of graph edges 
by an order of m we reduce the algorithm’s cost by an order of m. This same trick can be 
applied to the algorithms in [5] and [IS] 3 . 

Bokhari docs not discuss parallelization of his methods on the target architectures. 
Even after improvement, his linear array solution is very ill-suited for parallelization on 
the array. A natural approach is to partition the solution graph, and require every proces- 

■'I’riv.il.c communication from Shahid Bokhari 
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<s> 


<1,T> <1,2> <1,3> <1,4> <1,5> <Tj6> Processor 1 


<2,2> <2,3> <2,4> — <3,3> <3,4> — Processor 2 


<2> <3> <4> <5> <6> <7> New Layer 


<3,3> <3,4> — <4,4> <4,5> — <5,5> — Processor 3 

<3> <4> <5> <6> <7> <8> New Layer 

/s' / i \ \ 

<4,9> <5,9> <6,9> <7,9> <8,9> <9,9> Processor 4 



Figure 3: Improved Layered Graph for Linear Array Problem, 9 modules, 4 processors 


sor to compute the least cost path to the nodes it is assigned. The computation proceeds 
in stages — find the least-cost paths to layer 2 nodes, then layer 3 nodes, etc. It is not 
difficult to see however that the communication requirements of this approach are enor- 
mous: there is communication across at least one link for every graph edge cut by the 
partition. Furthermore, if the nodes are distributed evenly among processors, then f2(m 2 ) 
values will have to be broadcast between each of n — 1 steps. The communication complex- 
ity alone is equivalent to the complexity of a serial solution. The method just described 
might work well on a shared memory machine, provided that the number of processors is 
small, and that the communication network is fast relative to the processor speeds. The 
cost model assumes serialized communication, so again we have an 0(nm 2 ) communication 
complexity. These observations also apply to a host-satellite system if the satellites in a 
host-satellite system can communicate through the host’s memory. 

If we have a computation which is decomposed into a very large number of modules, 
and if we desire to take advantage of the parallel hardware our mapping methods tar- 
get, then Bokhari’s methods leave room for improvement. In the following sections we 
discuss improved serial algorithms, and give parallel mapping algorithms based on these 
improvements. 
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4 Linear Array Problem 

Bokhan’s method for solving the linear array problem does not rely on a probe in the 
same way that his shared-memory and host-satellite solutions do. Our approach is based 
on Iqbal’s[7], who developed a probing approach for finding an approximate solution. 
Like Bokhari’s sum-bottleneck method we will probe the space of bottleneck values. Our 
improvements stem from increasing the efficiency of the probe method, and from exploiting 
the problem structure to avoid the cost of sorting all bottleneck values. The subsections 
to follow discuss these improvements, show how to parallelize the algorithm for execution 
on the linear array. 

4.1 An Improved Probe Function 

Our method is based on Iqbal’s probe function PROBEl(ie), which is shown in fig- 
ure 4. PROBEl(u>) determines whether it is possible to assign the workload so that 
every processor’s execution time is less than or equal to the bottleneck constraint w. 
PROBEl(te) iteratively chooses a feasible subchain load for the “next” processor. Given 
that a processor’s subchain begins with module Mi, PROBEl(ie) finds that j such that 
(i) Slij = C,_ j + Sij + Cj < w, and (ii) the remaining unassigned load Aj = Cj + S( J+ i) m 
is minimized. Iqbal proves that this rule will find an assignment whose cost is no greater 
than w, if one exists. 

In the worst case, for every processor assignment PROBEl(u>) will consider making 
module Mj ( j > n) a subchain right endpoint. PROBEl(te) always considers making Mj 
an endpoint on every iteration where Mj is still unassigned. This gives PROBEl(iu) an 
0(nm) complexity. 

Consider the problem faced by the inner loop of PROBEl(u>): among all j 6 [i, m] 
such that ilij < w, find the ji in ; n minimizing A j. PROBE1 examines the entire interval 
[?', in] for this point; instead we appeal to the problem’s structure and quickly find a small 
subinterval [fc m j„, fc max ] which must contain jmn- 

Define the functions Aj 1 = S {] + Cj, and w(i) = w — C,_j + Sq,-!) and note that 

C i-i + + Cj < w 5 1(i _,) + Sij + Cj < w — C{-\ + S 1{i . 1) 

or (‘quivalently, 

Q tJ < w AJ 1 < w(i). 

If we can find the largest j such that AJ 1 < w{i ) we will have found the largest j such that 
flq < w. Let fc max denote this upper bound. k nvix can be quickly found with a pre-computed 
array right.min, whose jth entry equals k if the minimum value of A -1 over [j,m] occurs 
at position k. right.min is computed once in 0{m) time, and is thereafter employed by 
every probe call. A r,\ht^nin(j) necessarily increases monotonically in j. Given w and i, fc max 
is simply the greatest index j greater than or equal to i such that A < w(i). If 


Definitions 


W T Sum of all modules weights: W T = w i 
Qij Processor cost if assigned subchain Mi, . . . Mj 
Qij — Ci-i + Si ij + Cj\ 

Aj Total “remaining” load after assigning My. A j = Cj + Y,T=j+i w j 


function PR0BE1 ( w ) :Boolean; 

{ 

* = i; p = i; ^ = 0; A n ,i n = Wx; 

while p < n do 

{ 

for j = i to m do 

if Qij < w and Aj < Amin then 

{ 

^min — Aj, 

k = j ; 

} 

Assign subchain Mi,..., Mk to processor p; 
if k = m then return(true); 
i = k + l;p = p+ 1; 

} 

return( false); 


Figure 4: Iqbal’s probe function 


k nwx exists, it can be found in O(logm) time with a binary search. If the search fails to 
find a feasible solution then no solution exists. 

Having found k max we can find the lower bound k nun . Note first that 

max C k 

max + S(k max-fl)™* 

As j decreases, Sjk mxx necessarily increases, and eventually exceeds Ck mxx . We choose fcmin 
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to be the largest j where this occurs. For any j < k m j n we have 

Aj = Cj + + + ^ + S(fc mM+ ,) m = A*, mlx . 

Consequently, any j < fc m ; n may be ignored as a solution. If k^n < *, we take k^ n = i. 

Since Sjk mtx must increase as j decreases, k^n can be found with another binary search, 
on the “virtual array” . . . , S fcm „jt m „. Note that for any ij, S,y = 5^ - so 

that 5,j can be computed in constant time if the S^’ s are pre-computed. This means that 
the virtual array need not be explicitly computed, and the search for requires only 
O(logm) time. 

A linear scan for feasible points in [fcn,j n , will find the feasible point minimizing 
A. Since we have assumed that the communication costs are bounded from above by some 
constant independent of m, the linear scan takes 0(1) time. Figure 5 presents pseudo-code 
describing this new 0(n log m) probe function PR0BE2(ia). Note that a returned value 
of false occurs only if for some processor there are no feasible assignments. Like Iqbal’s 
probe, PR0BE2 will return true if a feasible mapping is found which uses fewer than n 
processors. 


4.2 Improved Search Organization 

At, this point we could simply sort the 0(m 2 ) unique processor loads, and find the smallest 
feasible one with O(logm) calls to PR0BE2. This algorithm’s complexity is dominated by 
the 0(m 2 log m) complexity of sorting. To further improve the probing approach we will 
have to reduce the cost of organizing the search. We do so by replacing the 0(m 2 log m) 
cost of finding 0(log m) probe values with an 0(m log m) cost of finding 0(m ) probe values. 
Because the probe calls are cheap, increasing their frequency to avoid a sort improves the 
overall performance. 

For the moment, assume that all communication costs are zero so that every processor’s 
execution time is of the form S tJ . Furthermore, we extend the definition of S,y to allow 
i > j : 

Sij = S u 5q,'_ip 

This definition encompasses the earlier one, and also shows that Sij can be computed in 
constant time if all sums of the form are known. 

We are able to infer that some execution time weights are larger than others, regardless 
of the module weight values. In particular, < 5^ whenever j < k, and 5,y > Skj 
whenever i < k. This partial ordering is illustrated in figure G with a dominance matrix. 
Row entries ascend in value from left to right, column entries descend from top to bottom. 
By transitivity it follows that < S uv whenever i > u and j < n. 

We will call any contiguous portion of a row a strip. On any given strip we can use 
binary search and a probe function to identify the entry with smallest execution time 
weight that satisfies the probe. This observation allows us to eliminate large portions of 
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function PR0BE2 ( w ) :Boolean; 

{ 

* = 1 ; p = 1 ; k = 0 ; 
while p < n do 

{ 

^niin = , 

W (i) = W - Ci_! + S lt ; 

Use binary search to find the greatest j 

such that A rig ht_min(j) < 

If no such k max exists return(false); 

Use binary search to find the greatest j < fc max 

such that S jkm „ > 

for J ^rriin t® ^’max do 

if &ij < w and A, < A^n then 



Assign subchain Mi ,.. . , M k to processor p; 
if k = m then return(true); 
i = k + 1\P = p + 1; 

} 

} 


Figure 5: Improved Probe Function for Linear Array Problem 


the search space. Consider a rectangular region of the dominance matrix that is h entries 
high and / entries long. Consider the effect of doing a binary search on the strip which best 
bisects the rectangle? into equal sized pieces. Let Sij be the minimal feasible strip entry 
found by the search. Any S uv with u < i and v > j lies above and to the right of S,y; any 
such entry dominates 5,j and may therefore be discarded as a solution possibility. Any 
S xv with x > i and y < j lies below and to the left of any such entry is dominated by 
the value S,(j_i) which is known to have failed. Such entries may also be discarded as a 
solution possibility. Since the strip bisects the rectangle into equal sized pieces, one half 
of the rectangle’s entries are eliminated by the binary search; the remaining entries fall 
into no more than two regions which are again rectangular. These points are illustrated 
graphically in Figure 7. In order to find the minimal feasible solution within the rectangle 
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Figure 6: Dominance Matrix of 5,j values 


it suffices to apply this procedure recursively to the remaining rectangles. The recursion 
stops when a rectangular region consists only of a strip; then a binary search finds the best 
feasible strip solution, if one exists. 

The efficiency is enhanced if throughout the search we maintain variables Vj and V,. 
Vj records the largest execution time tested so far which failed the probe test, V s records 
the smallest execution time tested so far which satisfies the probe. If the search procedure 
calls for a value V to be tested, the probe function needs to be called only if Vj < V < 

If the probe is called, either Vj or V, will be updated, depending on the probe outcome. At 
the end of the search procedure V 3 contains the minimal mapping cost. If the associated 
mapping has not been saved, a last call to PR0BE2 will create it. 

The lattice search technique calls the probe function more often than a binary search 
over a fully sorted set of bottleneck values, but avoids the high cost of sorting that set. Its 
utility rests in that it calls the probe function only 0(m ) times, a fact we now demonstrate. 

Define a rectangle evaluation to be the process of choosing a strip on a given rectangle, 
finding the minimum strip value satisfying PROBE2 (if any), and identifying the smaller 
rectangles, called children , which must also be evaluated. It is helpful to view the search 
process as a sequence of steps, where step 0 is the initial rectangle evaluation on the entire 
matrix. Step 1 consists of evaluating all children of step 0. In general, the ith step is 
composed of all evaluations of children defined by the previous step. We will say that a 
matrix entry is active at the beginning of the ith step if it lies within some rectangle that 
is evaluated during the ith step. We will also say that an entry is evaluated during the ith 
step if it lies on a strip over which a binary search occurs during the ith step. An evaluated 
entry need not actually be touched by the search. T hree observations are key. 

• The number of active entries in any matrix column decreases by one half every step. 

• The total number of evaluated entries during any step is no greater than m. 
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The maximum number of rectangles which are evaluated at step i is 2*. 


To see that the first point is true, consider any column in an evaluated rectangle. If the 
point found by the binary search lies in the column, or in one to the left, then only the 
lower half of the column entries are left active. If the point lies to the column’s right, 
then only the upper half of the column’s entries are left active. The second point follows 
from the observation that during a step, no two evaluated rectangles overlap in any row or 
column coordinates. If we sum the horizontal lengths of all evaluated rectangles the result 
is exactly rn. The third point is obvious, since any rectangle evaluation spawns no more 
than 2 children. 

From the first point we infer that there are no more than logm steps in the search. 
The number of PROBE2 calls required is the sum of calls by the binary searches involved. 
Because of the concavity of the log operation, the number of calls at a, step is maximized 
when there are as many binary searches as possible, over short lists. A binary search on a 
list of k items requires no more than logfc+ 1 probes. There are no more than m evaluated 
points at a step, and no more than 2‘ binary searches. The number of probe calls at a step 
is consequently bounded from above by 2’(log(m/2’) + 1). By summing over all steps, we 
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find the number of PR0BE2 evaluations to be bounded by 

log rn logm logm log m 

£ 2 , (log(m/2‘) + 1) = logm £ 2* - £ * 2 ‘ + £ 2 ‘ 

i=0 «=0 «'=0 1=0 

< 4m. 

The evaluation of *2* is accomplished using a general formula found in [9]. At the 

cost of adopting 0(m ) probe calls, we avoid the cost of a full sort. There is a payoff. 
O(m) calls to an O(nlogm) probe gives an O(nmlogm) algorithm, over the 0(m 2 logm) 
alternative. 

This search technique relies heavily on the lattice-like partial ordering of the dominance 
matrix. Redefining the dominance matrix by replacing each Sij with = C<_ i + + Cj 

destroys that partial ordering. However, a similar ordering can be discovered in 0(m logm) 
time with the following observation: 

Co + Si j + Cj < C'o + Su + Ck S\j + Cj < Sik + Ct 

O + Cj < Sik + C k 
O C,_ 1 + S{j + Cj < C,_i + So- + Ck. 

If we were to label each matrix element, with its rank within a sorted row, the implications 
above say that within a column all such labels are identical. A similar observation holds 
if we label elements with their column sorted rank. By sorting the first row we can create 
an array r where r(i) = j if the ith smallest element of a row is found in the jth column. 
Likewise, by sorting some column we can create an array p, where p(i) = j if the ith largest 
element of a column lies in row j. p and r are created once in O(mlogm) time. Imagine 
now that we create a sorted dominance matrix by physically re-arranging the dominance 
matrix columns so that the rows are ordered, and physically re-arranging the rows so that 
the columns are ordered. The sorted matrix has the desired lattice like partial ordering. 
We can use the same search technique as before on the sorted matrix. It is not necessary 
though to create the sorted matrix. Whenever we need to access the ij element of the 
sorted matrix, we create the p(i)r(j) element of the dominance matrix. 

The 0(7n logm) cost of creating r and p is masked by the O(nmlogm) cost of calling 
PROBE2 O(m) times. The overall complexity is again O(nmlogm). Even lower complex- 
ities are possible if we employ the linear array itself to solve the mapping problem. 

4.3 A Parallel Approach 

One approach to parallelizing our serial algorithm is to call the same 0(in ) probe values 
as the serial algorithm, using the linear array to compute PROBE2 in parallel. The 
only opportunity for parallelism here is to parallelize the search over [fc,„j„, fc max ], and 
then combine the individual minimums found by the processors. It takes each processor 
()( log 7 n ) time to find the interval endpoints, constant time to find a minimum over its 
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designated subregion of the interval, and then f2(n) time to find the global minimum. 
Asymptotically we lose with this scheme: the complexity of a single PROBE2 call is 
()(n log rn + « 2 ). 

A different approach is to have each processor perform a set of PR0BE2 calls indepen- 
dently, and in parallel with other processors. The strategy we propose is to decompose the 
implicitly sorted dominance matrix into n regions which are assigned to the processors. 
Each processor probes its space to find the optimal assignment within that space; an O(n) 
time combination of results finds the optimal mapping. 

We assume that every processor has enough memory to solve the problem alone. The 
module and communication weights are initially loaded into* the processors. Each processor 
serially computes its own copy of all sums of the form 5ut, its own copy of the rightjmin 
array, and its own copy of r and p. Each processor is now in a position to probe some 
region of the bottleneck space. The geometry of the regions we choose has an impact 
on the complexity. An analysis similar to the one presented for the serial case shows 
that the number of probe calls required to evaluate an h x / (where h < l) rectangle is 
0(h + h log (l/h)). Under the constraint that h • l is constant, it is not difficult to see that 
we want to make h as small as possible. The optimal approach is to assign each processor 
a ( m/n ) x m region of the sorted dominance matrix. The parallel time complexity is then 
the sum of an 0(m log m) cost to load the problem and create auxiliary data structures, 
an 0(m logm logn) cost to perform the searches in parallel, and an O(n) cost to combine 
the processor’s individual optimal solutions. The 0(m logm logn) cost dominates. 

5 Shared Memory Problem 

Our approach to the shared memory problem again uses a probe. We first show how to 
reduce the cost of a probe based on Kemighan’s algorithm [8] from 0(m 2 ) to O(mlogm). 
We then adopt the same search strategy as we did for the linear array problem and achieve 
an 0(m 2 logm) time algorithm. Finally, we discuss three approaches for parallelization. 
One approach divides the sorted dominance matrix into regions which are searched in par- 
allel. This approach yields an algorithm with an 0((m 2 /n) logm logn) time complexity, 
and O(nm) space complexity. A second approach uses a parallel sort, and then serialized 
binary search and probe calls. This algorithm reduces the expected time complexity to 
0((m 2 /n) logm), but increases the space complexity to 0(m 2 ). Our third approach par- 
allelizes the probe function, and is appropriate when n <C ra. Under technical conditions 
on n and m, its expected time complexity is 0((m 2 /n) log m), and its space complexity is 
only O(m). 

5.1 An Improved Serial Solution 

Iqbal's approximation method cites an algorithm described by Kernighan[8]. The algo- 
rithm partitions a chain of modules, subject to the contiguity constraint, and also subject 
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to the constraint that the sum of module weights in any partition is less than some fixed 
and pre-determined value w. The cost of a partitioning is the sum of the costs of links 
exposed by the partitioning. He formulates this problem using dynamic programming, and 
solves the optimality equations 

V(0) = 0 

V(j ) = Cj+ min {V(i — 1)} for j = 1, 2, . . . m. 

' » < i 

Sij < u) 

V(j) can be interpreted as the minimal cost of partitioning modules Mi through Afj, 
including the cost of separating Mj from M J+ 1 . Once V(m ) is determined the solution is 
found by backtracking. If j defines F(m)’s min term, then j + 1 is the left endpoint of the 
rightmost partition; if i determines V(j)' s min term, then i + 1 is the left endpoint of the 
next partition, and so on. 

This function can be used as a probe. If the chain can be partitioned into n or fewer 
pieces subject to the partition loading constraint, then the partition defines a feasible 
mapping; furthermore, it minimizes the sum of communication costs among all mappings 
with processor loads less than te. The probe compares the sum of communication costs 
with the probe constraint w\ if that sum is smaller, and if n or fewer partition elements 
are defined, it returns the value “true”. So long as w is kept fixed for all problem sizes this 
solution has O(rn) complexity. However, we vary w with every call to the probe function. 
In the worst case w is W ? , the sum of all module weights, and the algorithm is 0(m 2 ). 
Iqbal missed this fact, and in [7] ascribes an 0(m) complexity to this algorithm. 

Kernighan’s treatment considers w to be constant, so that the min term for every V(j) 
can be determined in constant time with a linear scan. Since our re’s will vary and may 
become quite large, we need to avoid linear scans. The min term can be efficiently found 
with the aid of a search tree which organizes domain points on the basis of their V values. 
The tree initially contains a single record corresponding to the boundary condition V(0) = 
0. A pointer where Js(0) to that record is stored to aid a future deletion. Subsequently, 
we compute each V(j ) by first identifying the indices over which its min term ranges. The 
minimal index i tn \ u satisfying S tJ < w can be found with a binary search on 5^, S 2 j, . . . , 5jj, 
and the where pointers are used to remove all tree records for V(i) with i < i m j n . The 
search tree is then examined for the entry whose key is least; this entry defines V(jys min 
term. V(j) is computed by adding the min term and Cj. A record representing V(j) is 
inserted into the tree, and the pointer where.is(j) to that record is saved. The auxiliary 
value back-ptr(j) is set equal to the index of the position defining V ( j)’s min term. 

O(m) tree insertions and deletions costs 0{m log m) amortized time using splay trees[l5]. 

The improved probe function can be used in conjunction with the search strategy 
described for the linear array problem. Note that a dominance matrix with type 
entries suffices. Letting S(b ) denote the minimized sum of communication costs with b 
as bottleneck constraint, recall that at the termination of the binary search we will have 
determined the smallest bottleneck value b such that h > 5(6). The optimal sum-bottleneck 
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solution is then either b or S(b), where b is the greatest bottleneck value less than b. Since 
b may be the solution we seek, it is important to be able to access it quickly. Suppose 
that throughout the search we maintain a value V n , the smallest bottleneck value larger 
than V a (the least known feasible solution). We claim that 6 must either be the value 
of V n at the end of the search, or be adjacent to b's location in the sorted dominance 
matrix. The claim is established by contradiction — suppose that b is not V n and is not 
adjacent to b. b is eliminated from consideration as the smallest bottleneck exceeding its 
associated communication cost in one of two ways, b may be eliminated because a smaller 
bottleneck value satisfies the probe. This bottleneck value can only be 6, and would have 
to be adjacent to b , a condition we have assumed does not occur, b can also be eliminated 
if a larger bottleneck value fails the probe. However, this is impossible because b itself 
passes the probe. This establishes the contradiction, and thus the fact that given b and 
V ni b can be found in constant time. 

The cost of O(m) probe calls, each with complexity 0(m log m), is 0(m 2 logm). Note 
that this same complexity is achieved if we sort the 0(m 2 ) bottleneck values and call the 
probe O(logm) times. However, the former approach needs O(m) space, while the latter 
requires 0(m 2 ) space. 

5.2 A Suite of Parallel Approachs 

Three different approaches for parallelizing the algorithm suggest themselves. One mimics 
our parallel linear array solution, and simply divides the dominance matrix into (m/n) xm 
sized regions which are searched in parallel. Each region requires 0((m/n) log n) probe 
calls, a cost which dominates the cost of combining the various processors’ optimal so- 
lutions. The overall time complexity of this approach is 0((m 2 /n) logm log n). Each 
processor requires 0(m) space. 

A second approach is to compute and sort the 0(m 2 ) bottleneck values in paral- 
lel. Techniques such as those described in [11], and [19] are appropriate, and have an 
0((m 2 /n) log rn) expected parallel complexity. A binary search over the sorted values 
may then be employed, with a serial probe. O(logm) probe calls are made, each with 
O(mlogm) complexity. The resulting algorithm has an 0(max{(m 2 /?r) log m, m log 2 m}) 
expected parallel time complexity, but requires 0(m 2 ) space for the sort. 

An 0((m 2 fn) logm) expected time complexity with 0(m) space requirements is possi- 
ble in the event that 8n 3 < m. In this case we can effectively parallelize the probe function. 
Our approach relies on the likelihood that if V(i) defines the min term for V(j), then i <C j. 
If V(j) does not depend on “nearby” values of I 7 , then “nearby” values of V can be com- 
puted in parallel. Of course, if V(i) and V(j) are computed in parallel and it turns out 
that V(j )' s min term should have been V(i), then we need to recompute V(j). We will 
see though that this occurs infrequently under our stochastic assumptions about module 
and weight values. It should be noted that unlike the other complexities derived in this 
paper, the magnitudes of the constants of proportionality are not obviously low. Without 
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further discussion on this topic, we note here that when the module weight distribution’s 
coefficient of variation cr / n is low, then the constants of proportionality are low. 

A general description of the algorithm follows. We divide the domain into successive 
blocks Z?i, I? 2 , • • • ffim/ni of n consecutive points each. We will compute all values of V 
within a block in parallel, assigning one processor per block point. The processors create 
and combine information describing the solution of V in the block area, and check to ensure 
that no value computed in the block depends directly on another value within the same 
block. If such a dependency is detected it can be corrected with a serialized computation 
of the block values. Once the block values are correct the processors move on to the next 
block. The backtracking phase to find the optimal partition is serial. We turn next to a 
more detailed description of this procedure. 

The algorithm begins with every processor initializing its own search tree such as was 
used in the serial version. The search tree may reside in the processor’s local memory. The 
global memory will contain the V array. Processor Pi then computes V(t). Since n <C m, 
it is unlikely that the probe weight w will be small enough so that S\ n > w, and it is 
highly likely that V(i ) = C, is the correct value for V(i). The processors cooperatively 
compute the minimum value m\ = mini<,< n {V(i)}. It is well-known that this can be done 
in logn steps with a combining tree as shown in figure 8(a). The entire tree is left in the 
global memory. Note however that communication is serialized, implying that the cost of 
building the tree is O(n). Figure 8(b) illustrates the fact that the minimum value of V over 
the last k items of a block can always be recovered from the combining tree by examining 
no more than logn entries. If 5j n < w then mi is the minimum value of V over the first 
block. Every processor inserts mi into its local search tree, and for the purposes of future 
deletion records a pointer to its location. 

The computation now proceeds in stages. The values for Bk are computed by the kth 
stage with the following operations. 

1. Serial Step: Note that Bk consists of integers in [(& — l)n + l,Am]. We must first 

determine whether it is feasible to compute all of Bk s points in parallel. A necessary 
condition for this is that the indices of V(kn)' s min term completely encompass Bk- 
This is checked by determining whether S((k-i)n)(kn) < u>. If not, then we cannot 
evaluate all of Bk ' s points in parallel. In this case we serialize the computation of 
the block, and advance to the next block. 

2. Parallel Step: Processor Pj is responsible for computing F((fc — l)n + j). Pj first 

uses a binary search to find the left endpoint imm(j) of the indices over which its 
min term is taken. Pj then deletes from its search tree all entries representing blocks 
including and lying to the left of i in i n (j)- Let i r (j) be the right endpoint of the block 
containing ?'„„„() ), and let vi(j) be the minimum value of V over [bnin(j), *r(i)]- vi(j) 
can be fotind by examining the combining tree over «min(i)’- s block. 

3. Parallel Step : Processor Pj finds the minimum value v s ( j ) within its own search 
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Figure 8: Combining tree to compute the minimum of n values 


tree. Then Pj computes V((k — l)n + j) = C^ k _ 1)n+j + min{vi(j),v a (j)}, and records 
in local memory a backjptr value giving the index which defines min{vj(j),i>,(j)}. 

4. Parallel Step: The processors cooperatively compute the minimum value Vb of V 

over the current block, with a combining tree. 

5. Serial Step: P n checks to see if its current V value is correct, by comparing V(kn) 

with C kn + v b ■ If the latter quantity is smaller, then the earlier computation was 
incorrect. Because the range of V(kn,ys min term includes all of B k , if any V com- 
puted in B k is incorrect, V(kn) will be incorrect and will be detected. When this 
occurs, the block’s points are recomputed serially. 

Over the course of the algorithm, an individual processor inserts, deletes, and searches 
for rn/n items in the search tree. Collectively this exacts an 0((m/n) log(m/n)) amortized 
time cost. In the absence of serialization, for each oim/n stages, step (1) takes 0(1) time; 
noting that communication is serialized, step (2) takes 0(max{n logn, logm}) time; step 
(3) takes 0(\og(m/n)) time; step (4) takes O(n) time due to serialized communication, and 
step (5) takes 0(1) time. In the absence of serialization the overall complexity depends on 
the relationship between m and n. If nlogn > log m, then the O(nlogn) cost of step (2) 
dominates and the algorithm has an O(mlogn) cost. If nlogn < logm, then the O(logm) 
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cost of step (2) dominates, yielding an 0((m/n) log m) algorithm. As m grows we expect 
that eventually the latter case will hold; for simplicity in exposition we assume that m is 
sufficiently larger than n to give an 0((m/n) logm) parallel time complexity in the absence 
of serialization. 

If the computation is serialized, a shared variable can indicate which processor is allowed 
to compute its value. A processor proceeds as before, except that the minimum value of 
V seen so far within the block must also be considered in step (3). Each point calculation 
takes O(logm) time, so the entire block takes O(nlogm) time. 

Without serialization the parallel complexity of this probe is 0((m/n) logm). Serial- 
ization may occur at step (1) when w is too small in relation to n. Because the m modules 
must be distributed over only n processors, we expect that each processor receives on the 
order oimjn modules, and that the values passed to the probe tend to be from convolutions 
of approximately m/n module sums. Intuitively then we see that serialization shouldn’t 
occur often, provided that m is sufficiently larger than n. The subsection to follow shows 
that if 8n 3 < m then serialization occurs so infrequently that the expected complexity of 
the entire algorithm is 0((m 2 /n) logm). 

5.3 Expected Complexity When 8n 3 < m 

If we can reduce the frequency of serialization to 0(l/n), the contribution of serialization 
to the algorithm’s overall complexity will be 0((m 2 /n) logm) which is exactly the parallel 
complexity. We will show that this occurs when m is sufficiently larger than n. We do so 
in three steps. First we show that if w > 2 np m , then the probability of serialization being 
required at step (1) of the parallel probe is 0(l/n). Secondly, we show that if w > n 2 /j. m / 2, 
then the probability of serialization being required at step (5) of the parallel probe is also 
0(l/n). Finally, under some simplifying assumptions we show that when 8n 3 < m, then 
probe calls with tv values less than n 2 p m / 2 occur so infrequently that the expected cost 
duo to serialization is only 0((m 2 /n) logm). 

Consider the parallel probe function. The first chance at serialization occurs in step 
(1). Let Pi(w) be the probability that the sum of n module weights associated with a 
block exceeds uk We assume that every module execution time is drawn independently 
from a common distribution with finite mean and standard deviation cr w . Likewise, we 
assume that the communication costs are independent and identically distributed, although 
they are allowed to be from a different distribution. Our analysis rests on two facts from 
probability theory. 

• If A'i, -Y 2 , . . . , Xfc are k independent identically distributed random variables with 

mean // and standard deviation a, then the mean of the linear combination J2i=i a»AT. 
is // and the standard deviation is a \/eLX. 

• Chcbychev’s Inequality If .Y is any random variable with mean /i and standard 
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deviation <7, and e is any positive number, then 

Prob{\X - n\ > e<7} < 1 

These facts may be found in any standard probability text, such as[10]. 

Let M(n) be an n-fold convolution of the module weight distribution. M(n) has mean 
rt/i m and standard deviation a m \fn. Serialization is chosen at step (1) if the sum of 
the block’s n module weights exceeds w. Appealing to a slightly re-organized form of 
Chcbychev’s inequality we have 

Prob{M(n ) > np m + e< 7 m y/n} < ^ . 

for any positive constant e. Choosing w = np m + ea m y/n and solving for e, we have 

P\(w) = Prob{M(n ) > w} 

= Prob{M(n ) > np m + ea m y/n} 

< ner m 

(w - nfi m ) 2 

whenever w > nfi m . If w > 2 np m , then the right hand side of this inequality is 0(l/n). 
We have proved the following theorem. 

Theorem 1 Let p\(w) be the probability of serialization at step (1). If w > 2 np m , then 
P\(w) ~ 0{\/n). 

Now let Pi(w) be the probability that serialization is chosen in step (4). This occurs 
when the min term of some V(j ) is defined by some V value in V(j) 1 s block. To show that 
p 2 (?o) = 0(1 /n) when w > n 2 p m / 2 we will need the following technical lemma. 

Lemma 2 For every j = 1,2, ... ,m let 

L{j,w) = {V(i) | i < j,Sij < ie}. 

Then for all j and w, min L(j,w) > min L(j — l,u>). 

Proof Suppose L(j - l,u>) = { V(i,), . . . , V(j - 2)} and L(j, w) = {V(i u ),. . . ,V(j - 1)}. 
Note that ii is necessarily no greater than i u . This implies that 

min{V(t u ), . . . , V(j - 2)} > mi: nL(j - l,u-). 


Now 

£0» = {V(i u ),...,v(j- 2)} u (V(, - 1 )} 
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so that 


min L(j,w) = min({V'(* u ),. . . ,V(j - 2)} U {Cj.j + minL(j - l,u>)}) 

> min ({V(i u ),...,V(j - 2)} U {minL(j - l,w)}) 

= min L(j — 1, u>). 

□ 

The main purpose of lemma 2 is to aid in the proof of the following lemma. 

Lemma 3 Let V(t)^V(i + 1), . . . , V(i + N — 1) be a consecutive sequence of V values . 
Then the probability that the minimum value occurs in one of the last n sequence elements 
is no greater than ti/N . 

Proof Let qj be the probability that V{i + j) is the minimum in the sequence. We first 
show that 

<7o > 9i > • • • > 

Consider the module weights to be fixed, but let the communication weights be random. 
Let J =< c,, c,_|_ i , . . . , > be any random vector sampled from the joint distribution 

of the communication costs, and suppose that under this joint vector V(i + k ) is minimum. 
By lemma 2, min£(i + k — j, w) < min L(i + Ar, w) for all j such that 1 < j < k. Since 
V(i + k — j) = inin£(i + k — j, w) + c;+fc_j > min£(i + k y w) + c t *+fc = V(i + fc), we must 
have > c,- + *. Suppose we swapped the costs c t +* and i- The swap does not 

affect any V(i + k — j) with j > 1, but clearly V(i + A: — 1) < V(i + k). Furthermore, any 
V value to the loft of V(i + k — 1) is larger, because 

rninL(i + k — j,w) + C{+k-j > mi nL(i + k,w) + Cj+jt =► 

min L(i + k — j,w) + > min£(z + A: — 1, w) + c t+ fc. 

Any value to the right of V(i + A — 1) must also be larger — the min term for some values 
V(i + k + j) to the right of V(i + k — 1) may change to either the new value of V(i + k) 
or V(i + k — 1), but the nev) value of V(i + k + j) cannot be less than the new value of 
V(i + k — 1). Because the communication costs are independent and identically distributed, 
the random vector which swaps the values i and c t +* in J has the same probability 
mass or density as J . Consequently, for any random sample where V(i + k ) is minimum 
there is an equally likely sample where V(i + k — 1) is minimum. As this is true for any 
sampling of module execution weights, we must have 

<7o > q\ > ... > qN~ l- 

For any descending sequence of N values, it is always true that the sum of the last n 
dements is no greater than n times the sequence average. The sequence average here is 
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l/N because the qj ’s must sum to 1. The probability that the minimum occurs in one of 
the last n positions is the sum of the last n sequence values, and consequently is no greater 
than n/N . 

□ 

At step (5) serialization is required at block Bk if V{kn)'s min term is defined by some 
value in Bk- Lemma 3 tells us that if L(kn, w) has N elements, then the probability of 
serialization is no greater than n/N. If we can keep the size of L(kn,w ) on the order of 
n 2 , then serialization occurs at step (5) with 0(l/n) probability. The size of L(kn,w) is 
a random variable which we call N*(w). Pi(w) is no greater than the expected value of 
nis[l/7V*(u>)]. The theorem to follow bounds this expectation by 0(l/n) in the event that 
w > n 2 fi m / 2. 

Theorem 4 If w > n 2 /j, m / 2, then P 2 (w) = 0(\/n). 

Proof 


Pi(w ) = Prob{ one of B[s V terms is minimum in L(kn,w)} 

< nE[\/N*(w )] (1) 

where the expectation is taken with respect to the distribution of N*(w). The function 
f(x) = \/x is decreasing, and is bounded from above by g(x), defined below: 

, f 1 If 1 < < n 2 / 4 

9{X) ~ { 4/n 2 If * > n 2 / 4 

Because g(x) > f(x) for all x, we must have E[g(N*(w)] > J5[l/iV*(iy)]. Now N m (w) is 
less than n 2 / 4 only if the sum of n 2 / 4 or fewer module weight random variables is greater 
than n 2 fj. m /2. The proof of Lemma 1 bounded a very similar probability using Chebychev’s 
inequality. Applying the same methodology here, it can be shown that the probability of 
N*(w) being less than n 2 / 4 is 0(l/n 2 ), if w > n 2 /z m / 2. We then have 

£’[(/(./V''( u >))] = Prob{M(n 2 /4) > ic} • 1 + Prob{M(n 2 / 4) < w) • 

= 0(l/n 2 ) 

Applying this to relation (1), the lemma’s conclusion follows. 

□ 

Theorems 1 and 4 tell us that if the probe weight w is large enough then serialization 
occurs infrequently. We next show that if m is sufficiently larger than n we can expect the 
probe weights used by our algorithm to be large enough to satisfy the theorems’ conditions. 
A note of warning is in order. The results to follow relate to pristine convolutions of the 
module weight distributions. The values of w chosen by our search procedure are indeed 
sums of module weights, but the distribution of those sums are affected by the history of 
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the search behavior. For example, suppose we choose to probe with value , found in the 
upper left rectangle identified by the the first rectangle evaluation. is not identically 
distributed with a sum of j — i + 1 independent module weights. We know that the probe is 
satisfied on S,* for some k > j — this was established by the first rectangle evaluation. We 
also know that for some k > i, S kj fails the probe. The former observation tends to make 
Sij “larger” probabilistically, because some portion of the chain it represents is involved 
with sums known to succeed. Likewise, the latter observation tends to make Sij “smaller” 
because the Mjt to Mj subchain weight must fail the probe. The affects of the search 
behavior on probe value distributions appear to be too complex to deal with analytically. 
But because of the conflicting influences on the probe value distribution it seems likely that 
these effects on the size of the probe values are second order compared to the effects on 
pure module weight convolutions of increasing the size of the sums. By assuming that the 
probe weights arc drawn from pure module weight convolutions, we can make statements 
about the probability of the probe function being satisfied. 

The discussion to follow speaks in terms of w being drawn from a convolution of k 
module weights, where k may vary. We have already used M ( k ) to denote a fc-fold con- 
volution of module weights. To say that w is drawn from that convolution we will write 
w ~ M(k). For our purposes three bounds are quite important and are summarized by 
the following lemma. 

Lemma 5 Let M(k) denote a k-fold convolution of module weight random variables. Then 

(a) Prob{M(k) > 2k n m ) = 0(l/k). 

(b) Prob{M(k) < kn m / 2} = 0(l/k). 

(c) If Mi(k) and M 2 (2k) are independent convolutions, then Prob{M x (k) > M 2 (2k)} = 

O(lfk). 

Proof (a) and (b) are found in a manner entirely similar to the proof of Theorem 1. (c) 
is found in the same fashion by first noting that 

Prob{M x (k) > M 2 {2k)} = Prob{M x (k) - M 2 (2k) > 0}, 

and that the random difference has mean —kji m and standard deviation a vl yjZk/2. 

□ 

An important component of our search strategy is to call the probe function with 
bottleneck value w only if w exceeds Vj — the greatest probe value known to fail, and if 
w is dominated by F,— the best known solution to date. This test offers protection from 
serialized probe calculations when the probe value tr touched by the search is small; with 
high probability w < Vj. Let Ij be the number of modules summed to form the value of 
Vj immediately after the first rectangle search. We will say that the search is irregular 
if Ij < in l An or Ij is undefined, and otherwise is regular. For the purposes of bounding 
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costs we will assume that any irregular search is completely serial, but then show that the 
probability of an irregular search is so low that the expected cost due to irregular searches 
is 0((m 2 /n 2 ) logm). We accomplish this by showing that the probability of an irregular 
search is 0(l/n 2 ). 

Suppose that If is defined, and equals k < m/4rc. This implies that some tv drawn 
from a convolution of k + 1 modules actually satisfies the probe. For simplicity we assume 
that tv ~ M(k +1), although this is not rigorously true. The probability that a M(k + 1) 
random variable satisfies the probe is no greater than the probability that w ~ M(m/4n) 
satisfies a new probe which passes automatically if w > m/i m /2n, and which calls the 
original probe otherwise. The new probe is constructed only for the purpose of bounding 
probabilities. The probability of the new probe passing automatically is the probability 
that w > rn/x m /2n; but since w ~ M(m/4n), lemma 5(a) says this probability is 0(4n/m). 
As 8 n 3 < m, the probability of the new probe passing automatically is 0(l/n 2 ). The new 
probe is also satisfied if w < mfx m /2n , and the old probe passes w. A necessary condition 
for the old probe to pass w is that each of n processors receives a load less than or equal 
to tv. This implies that the sum of all module weights can be no greater than nw. Given 
that w < ro/i m /2n , the stun of all module weights can be no greater than mpi m /2. But by 
lemma 5(b) the probability of this occurring is 0(1 /m). Finally we consider the possibility 
that Ij is not defined. For this to occur the least weight on the first strip must pass 
the probe, a weight composed of a single module weight. The same types of arguments 
as used above will obviously establish that the chance of this occurrence is infinitisimal. 
Consequently, the chance of an irregular search is 0(1 /n 2 ). 

Now wc show that the expected complexity of a regular search is 0((m 2 /n) logm). 
Since the search is regular we have Ij > m/4n. Let w be a weight touched by the search. 
Two cases may occur. 

Case 1 Suppose that w is composed of k module weights, and k < m/8n. For simplicity 
we assume that w ~ M(k). A necessary condition for actually calling the probe 
function is that M(k) exceed the value V), the value of Vj immediately after the first 
rectangle evaluation, w is not independent of Vf, but we will assume so for the sake 
of tractability. The probability of calling the probe function is then bounded by the 
probability that a convolution Mi(m/8n ) exceeds another independent convolution 
A/ 2 (m/4n). By lemma 5 the probability of this occuring is 0(n/m) = 0(l/n 2 ). If 
we assume that an actual probe call must serialize because w is too small, then the 
expected cost due to this occurrence is only 0((m 2 /n 2 ) log m). 

Case 2 Suppose that w is composed of k module weights, and k > m/8n. For the purposes 
of bounding costs, suppose that if tv < mfi m /16n then the search serializes. By 
lemma 5(b) the probability of this is O(nfm) = 0(l/n 2 ), and the expected cost 
of serialization in this fashion is 0((m 2 /n 2 ) logm). But if tv > m/i m /16n, then 
w > n 2 n/ 2 because we have assumed 8n 3 < m. By theorems 1 and 4 the probability 
of serialization is only 0(1 jn). 
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Finally, we must consider the behavior of the search during the first rectangle evalua- 
tion. While unlikely, the worst case occurs if each of logm probes serializes. The cost of 
evaluating the first rectangle is then 0(m log 2 m). However, when m is sufficiently larger 
than n (m 2 ''* > logm) this cost is dominated by the parallel 0((m 2 /n)logm) complexity. 

The discussions above have shown that when 8 n 3 < m then the overall expected time 
cost due to serialization is 0((m 2 /n) logm). The expected cost in the absence of serializa- 
tion was also 0((m 2 /n) log m), making this expression the overall expected time complexity 
The space required for the parallel probe is only O(m). 

6 Host-Satellite Problem 

Our approach to the host-satellite problem is again modeled on Iqbal’s probing approach. 
For a given bottleneck value w we apply a PR0BE2-like function (from the linear array 
problem) to each satellite chain. The bottleneck weights are all of the form O^-, where the 
S2 function is identical to that of the linear array problem. This probe will load the satellite 
with the feasible load which minimizes the A function. The unassigned load is given to 
the host, and the communication cost of breaking the chain is suffered by both the host 
and the satellite. The host’s cost is the sum of the n off-loaded subchains, the associated 
communication costs, plus some additional load H which it must always compute. Since 
each satellite minimized the load given to the host under the bottleneck contraint on 
satellite loads, the host’s load is minimized. The probe returns true if the host’s load is no 
greater than the bottleneck weight. As before, we will first improve upon the known serial 
solutions, and then show how to parallelize the mapping algorithm. We will reduce the 
serial time complexity to 0(max{nm logm, n log 2 m}), and find a parallel solution with 
0( max {nm, n log m max{n, logm}}) complexity. When m is sufficiently larger than n the 
rim term will dominate; in this case the complexity is within a constant factor of optimal 
under the; assumption that O(nm) time is required to load the problem onto host-satellite 
system. 

6.1 An Improved Probing Approach 

The set of bottleneck weights for the host-satellite problem has a different structure than 
that of the previous two problems, but it is still exploitable. The bottleneck weights for 
a given chain are of the form C 0 4- S X j + Cj = fiij, and consequently are not necessarily 
monotone increasing in j. It is important to remember that each chain has its own set of fi 
values. To allow the possibility of moving a satellite’s entire chain onto the host we define 
ibo — Co, where Co is the communication cost of transmitting the satellite’s incoming 
data to the host. The assumption that communication costs are bounded allows us to 
sort a chain’s bottleneck values in O(m) time, using brute force. Define arrays rightjess 
and left-greater, each with m + 1 entries, and all entries initialized to zero. At the end 
of the algorithm lc.ft.greater(j ) will contain the; number of bottleneck values Qu such that 
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i < j, and fix, > fij^. Similarly, rightjess(j) will contain the number of bottleneck 
values fix* such that k > j , and fix* < fi lr fi.j’s rank (rank 0 meaning smallest) in 
the sorted list is consequently right Jess(j) -f j—left_greater(j). The trick is to efficiently 
compute the auxiliary arrays. For every j — 0, . . . , m we scan increasing values of fix*, 
k > j incrementing rightJess(j) and left_greater(k ) every time we encounter a k such that 
fix * < Qij. The important point is that we may stop scanning as soon as k is so large 
that Cj < Sjk, because we are assured that fix* for larger k is always larger than fijj. 
Because the communication costs are bounded, these scans require constant time. Given 
the ranks, the items can be sorted in O(m) time. This gives the sorting algorithm an O(m) 
complexity. 

0{nm ) time is required to compute the auxiliary data structures for the probe function, 
and to sort each of n vectors of bottleneck values. The n sorted vectors can be merged into 
a single sorted list in 0(nm log n) time. A binary search over the sorted list of bottleneck 
values with a probe call at each touch has 0(n log 2 m) complexity. As before, we must 
also consider the next smallest bottleneck weight b which passes the probe, b must lie 
adjacent to the bottleneck value found by the search and so is considered in constant 
time. Depending on the relationship between n and m, the overall complexity is either 
O(»mlogn) or 0(n log 2 m); in either case an improvement over Bokhari’s 0(nm 2 log m) 
solution, or our 0(nm log m) improvement upon Bokhari’s solution. 

6.2 A Parallel Approach 

The sorting step dominates the complexity of our serial algorithm. If we treat the host 
like a shared memory, then the satellites could conceivably sort the bottleneck values in 
parallel. However, in all likelihood a real host-satellite system will not emulate a shared- 
memory machine particularly efficiently, so that we should practically consider another 
approach. 

An easy way to exploit parallelism is to perform the probe function in parallel. The 
natural way to do this is to have each satellite call a PROBE2-like function on its own 
subchain structure. To support such an approach, each satellite is loaded with its own 
subchain costs. In parallel, each satellite sorts its own fi values as previously described. 
The probe values will be selected by performing a binary search over each satellite’s list 
of bottleneck weights; first we search the entire list of the first satellite, then the entire 
list of the second satellite, and so on. For every probe touch the host can query the 
appropriate satellite for the proper probe value, and then transmit that value to every 
satellite. Each satellite then calls a PR0BE2-like function to determine the feasible load 
which minimizes the remaining load (which is the host’s cost), and reports the remaining 
load to the host. The host computes its own load and determines whether the probe 
passed or failed. Loading the problem onto the satellites takes O(nm) time. Each parallel 
probe call takes 0(max{n, logm}) time; there are O(nlogm) probe calls. The overall 
parallel time complexity is 0(max{nm,n logm max{n, logm}}). When m is sufficiently 
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larger than n the 0(nm) cost of loading the problem dominates. In this case the algorithm 
is within a constant factor of optimal, if we assume that the time to load the problem onto 
the host-satellite system is proportional to the problem size. 

7 Summary 

We have examined three parallel mapping problems: mapping a chain of modules onto a 
linear array, a chain of modules onto a shared memory machine, and mapping a set of chains 
onto a host -satellite system. In each case we determine the mapping which minimizes the 
computation’s finishing time, subject to a contiguity constraint. These problems were 
originally shown to be tractable by Bokhari in [4]. Our work builds on his by first showing 
that his solutions can immediately be improved by a factor of m (the number of modules), 
and then by demonstrating that there are much more efficient solutions than those that 
demonstrated the problems’ tractability. In addition, we showed how the target parallel 
architectures themselves can be used to compute the optimal mapping. In some cases we 
showed that algorithms with bad worst case complexity have good average case complexity. 
The table below compares the time complexities of Bokhari’s original algorithms, our 
improvement on those algorithms, Iqbal’s approximation methods, our serial and parallel 
improved methods. In some cases we have simplified complexities by assuming that m is 
much larger than n. 


Problem 

Bokhari 

Improved 

Bokhari 

Iqbal 

(Approximate) 

Improved 

Serial 

Parallel 

Linear 

Array 

«m 3 

nm 2 

mn log (Wj/e) 

nm log m 

m log m log n 

Shared 

Memory 

nm 3 log m 

nm 2 log m 

m 2 log (Wr/e) 

m 2 logm 
(amortized) 

(m 2 /n) logm 
( expected, 
amortized) 

Host- 

Satellite 

mn 2 logm 

nm log m 

nm log (Wt/c) 

nm log n 

nm 
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